In [16]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np

UN Population Projection Errors¶

In this notebook, we will attempt to estimate the error in the UN City Population Predictions for African cities. The predictions are taken from the 2018 Revision of World Urbanization Prospects. The predictions were made in 2018 and predict city populations from 2019 to 2035. To test these predictions, we sourced more recent population estimates that were made after 2018.

Load Data¶

In a previous notebook, me merged the UN population predictions with the more recent city population estimates that we found e.g. if we found a census population estimate of Nairobi for 2021, then we matched this with the 2021 UN population prediction for Nairobi.

Below we load in this data:

In [17]:
cities = pd.read_csv('UN_projections_compared.csv')
cities.head()
Out[17]:
Region Country_Code Country_or_area City_Code City data_sources_UN City_Definition Latitude Longitude year ... city_cleaned UN_proj_error UN_proj_error_% mag_UN_proj_error mag_UN_proj_error_% latest_census_data_UN latest_estimate_data_UN most_recent_UN_source years_since_latest_UN_source data_source_type
0 Africa 854 burkina faso 23191 bobo-dioulasso Censuses of 1951, 1960, 1975, 1985, 1996 and 2... City Proper 11.17715 -4.29790 2019 ... bobo-dioulasso 19450.0 0.021041 19450.0 0.021041 2006 2015 Estimate 3 Population Census
1 Africa 854 burkina faso 23192 ouagadougou Censuses of 1951, 1960, 1975, 1985, 1996 and 2... City Proper 12.36423 -1.53834 2019 ... ouagadougou 237454.0 0.089513 237454.0 0.089513 2006 2015 Estimate 3 Population Census
2 Africa 140 central african republic 20410 bangui Censuses of 1955, 1964, 1975, 1988 and 2003; E... Urban Agglomeration 4.36122 18.55496 2023 ... bangui -601665.0 -0.627823 601665.0 0.627823 2003 1966 Census 15 Population Estimate
3 Africa 178 congo 20848 brazzaville Censuses of 1955, 1961, 1974, 1984, 1996 and 2... City Proper -4.26583 15.28318 2019 ... brazzaville 326076.0 0.141276 326076.0 0.141276 2007 1950 Census 11 Population Projection
4 Africa 178 congo 20849 pointe-noire Censuses of 1958, 1974, 1984, 1996 and 2007; S... City Proper -4.77609 11.86352 2019 ... pointe-noire 143513.0 0.122085 143513.0 0.122085 2007 1962 Census 11 Population Projection

5 rows × 33 columns

Key Error Variable¶

The key variable we will focus on is UN_projerror%. This variable is defined mathematically below:

$UN\_proj\_error\_\% = \frac{population\_UN - population\_measured}{population\_UN}$

This expresses the prediction error (the difference between the UN population estimate and the measured population) as a percentage of the original UN estimate. For example, if the UN predicted a population of 1,000 and the measured population was 500, then the error is 1,000 - 500 = 500. We then divide this by the original UN estimate 1,000 to get a UN_projerror% of 0.5 or 50%.

A positive value means the UN over-estimated the population and a negative value means the UN under-estimated the population.

We decided to use this rather than the raw error value (the difference between the UN predicted population and the measured city population) because we strongly suspect the error will be proportional to the size of the UN Population estimate (e.g. a projection of 1,000,000 will likely have a larger typical raw error figure than a projection of 400,000).

By finding an error value that is expressed as a fraction of the UN population estimate, we remove the variation in error due to the size of the UN population. This way, we can hopefully find a typical figure e.g. 15% that all UN predictions vary according to, regardless of how large or small they are.

Below, we create 2 plots to validate our assumptions:

  1. Raw error against the UN projected population
  2. UN_projerror% against UN projected population
In [18]:
fig1 = px.scatter(cities, y = "mag_UN_proj_error", 
                  x = 'population_UN', hover_name='City', 
                  range_x=(0, 6000000), title = 'Raw Error vs Projected UN Population')
fig1.show()
In [19]:
# Note we use used the magnitue of the raw error, because we are not 
# interested in whether they are over or under estimates, only if the 
# size of the error correlates with the size of the population predicted
fig2 = px.scatter(cities, y = "mag_UN_proj_error_%", x = 'population_UN',
                  hover_name='City', range_x=(0, 6000000),
                 title = 'Error as Fraction of UN Population vs Projected UN Population')
fig2.show()

Observations:

  • We can clearly see a correlation between the raw error and the UN projected Error. Larger population estimates can get much larger errors.
  • The UN_projerror% shows no relationship with UN population - it is independent of the size of the prediction.

Concerns about Measured Population Figures¶

The measured city data was sourced from https://citypopulation.de/ which provided 3 types of population estimate (defined here):

  • Population Census - Directly measured and / or adjusted according to a nation wide census.
  • Population Estimate - Calculated using the current birth and death rates and the estimated migration or they are based on population registers.
  • Population Projections - Are calculations referring to the future, their accuracy is often poor.

We believe the Population Census figures are typically reliable. We assume they are a source of truth, which we can directly compare to UN population predictions. We are not so sure about the Population estimates and Population projections. Since we have relatively few census population figures to use (because only a small number of countries have had a census since 2018) we think it is worth including these other possible less reliable figures.

We will start our analysis by comparing the 3 different estimate types to see if they show significantly different error trends. If they do, it may imply we should be careful when using them to estimate the accuracy of the UN projections.

Comparing Estimate Types¶

Below, we quantify the number of observations we have within each of the 3 city population estimate categories:

In [20]:
fig3 = px.histogram(cities, x = 'data_source_type', 
                    title = 'Count of Population Figures per Estimate Type', 
                   color = 'data_source_type')
fig3.show()

Observations:

  • The majority of figures are Population Census (19 total) and Population Estimates (23 total), which we suspect are the more reliable figures.
  • We have relatively few Population Projections (12 total)

We now want to identify if the different estimate types show similar or different error trends. Firstly, we plot a histogram of the UN_projerror%, where we color each of the different estimate types separately:

In [21]:
fig4 = px.histogram(cities, x="UN_proj_error_%",
                    color = 'data_source_type', 
                    title = 'Distribution of Projection Error per Population Estimate Type')
fig4.show()

Observations:

  • The overall distribution looks roughly normal distributed, with some large outlying values of around 60%.
  • The Population Projection figures show the least variation and are closely spread around 0. This may suggest the population projections have relatively small errors because they follow a similar logic / methodology to the UN projections and are thus not really testing them.
  • The Population Estimate and Population Census are more spread out, with Estimates tending to lead to under-estimates (negative values) and census data tending to lead to over-estimates (positive values). This could suggest the estimates are generally overestimating the real populations (since it is shifted relative to what we suspect are the 'True' or trustworthy census values).
  • The spread in values for the 2 aforementioned variables may imply they are reflecting more real life variation from the projections and are therefore reliable.

We will now look at a box plot of absolute error values (no negatives) to view the data in a different way:

In [22]:
fig5 = px.box(cities, y="mag_UN_proj_error_%",  
              color = 'data_source_type', 
             title = 'Projection Error Box Plots for Different Population Estimate Types')
fig5.show()

Observations:

  • We can clearly see that the Population Census data has the largest spread. The data is concentrated across a larger typical error value i.e. Population Census results lead to typically larger errors.
  • It looks like the Population Estimates and Population Projections have similar spreads, although the population Estimate data is concentrated over a slightly higher errors. Overall, we see a clear trend of lower typical errors and less spread of errors as we go from left to right.
  • If we assume that correct population figures are likely to show more spread and larger errors because the models will fail to account for the complexity and nuance of city population growth, then perhaps this means we can rely on the Census data most and should be increasingly skeptical of results from the Population Estimates and Population Projections.

Finally, we will calculate some summary statistics for each estimate type and see what conclusions we can draw from them:

In [23]:
metrics = ['mean', 'median', np.std, 'max', 'min']
cities.groupby('data_source_type')['UN_proj_error_%'].agg(metrics)
Out[23]:
mean median std max min
data_source_type
Population Census -0.032778 0.021041 0.234340 0.507310 -0.531148
Population Estimate -0.079687 -0.084522 0.184166 0.292731 -0.627823
Population Projection 0.073150 0.084221 0.127437 0.261639 -0.179335

Observations:

  • If the median is positive, it means that the majority of observations are over-estimates. If it is negative, the majority are under-estimates. We can see that for both Population Census and Population Projection the median is positive, meaning that most predictions were over-estimates. This agrees with the histogram above. Population Estimates had a negative median and thus the majority of projections were under-estimates. This could imply the population estimates are typically over-estimating the population, leading to the apparent under-estimates from the UN predictions.
  • The mean is less useful / easy to understand, however it may weight the typical behaviour of under or over estimating according to the size of the over-estimate (e.g. perhaps there are fewer under-estimates but the ones that do occur are much larger). We can see that for Population Census the mean is negative and has a similar magnitude to the median (which is positive). This may suggest that while the majority of projections are over-estimates, some of the under-estimates are very large and pull the mean negative. For Population Estimates and Population Projections we see the mean and median are almost identical, suggesting that outliers have little impact.
  • Looking at standard deviation and min / max, we see that the Population Census errors have more spread and more outliers then Population Estimates or Population Projections.
  • Overall, the means and medians are typically small (~2.5% for population Census and ~8% for Population Estimates and Population Projections). This suggests no strong bias towards under-estimate or over-estimate. Given the small number of observations, any measured bias could easily be a statistical fluctuation.

Caution - the small mean and median values for Population Census are not measures of typical accuracy. The negative and positive values cancel each other out, leading to smaller or larger values, this is a measure of balance. Performing the same calculation on the magnitude of the error yields very different results.

Conclusions:

  • It seems clear that the Population Census data behaves quite differently to the Population Estimate and Population Projection data - It has a larger spread and typically larger errors.
  • We will treat population census data as the most reliable.
  • We will treat the Population Estimate and Population Projection data as fairly questionable. We believe the Population Estimates may be more reliable, but not by much, if at all.

What is the typical UN Projection Error?¶

In this section, we will try to estimate the typical projection error in a UN population prediction. This typical error will be used by city planners to help them understand how reliable an estimate is likely to be. As such, we will be looking to create a conservative estimate that captures the largest likely error. This will hopefully allow city planners to bear in mind and prepare for worst case large errors in the prediction which could cause major problems in city planning.

This means we will be looking for the largest error boundary that would capture the majority of population predictions. It will not however include outliers. We have explored 2 outliers to understand how they may be interpreted. In one case (Freetown in Sierra Leone) the census data has been called into question because it showed the population halve since the last census. In other cases, it seems the UN data was simply quite wrong. Either way, the typical error will be a measure of the maximum width of typical (i.e. the most common) projection errors.

We will create a rough estimate of the typical error by calculating the same statistical metrics above but this time on the magnitude of the error:

In [24]:
cities.groupby('data_source_type')['mag_UN_proj_error_%'].agg(metrics)
Out[24]:
mean median std max min
data_source_type
Population Census 0.174853 0.150237 0.154198 0.531148 0.021041
Population Estimate 0.143775 0.109804 0.137640 0.627823 0.000153
Population Projection 0.117446 0.115836 0.083847 0.261639 0.024128

Observations:

  • As expected, we see that Population Census has the largest mean and median values, implying the largest typical error in predictions. It has a mean and median of 17% and 15% respectively, suggesting a typical error of ~16%. This seems relatively large, especially depending on the size of the UN population prediction (remember this error is a fraction of the UN population projection, so for a prediction of 500,000, the error would be +- 80,000 and for a projection of 1,000,000 this would double to +- 160,000). From a city planning perspective, this does seem like quite a large error, we would assume a city growing by 10% alone would be a lot, let alone having this margin of error in your prediction.
  • We see a maximum error of about 50% and a minimum of 2%, showing that the predictions can be quite accurate (2% seems very respectable).

The mean and median estimates of error are only rough estimates. They do not clearly define the number of projections whose errors sits within that threshold (e.g. the maximum error percentage for 80% of UN projections).

To get this data, we create a cumulative distribution plot below on the magnitude of the error:

In [25]:
fig6 = px.histogram(cities, x = 'mag_UN_proj_error_%', cumulative=True, 
                    histnorm='percent', height = 900, 
                    facet_row='data_source_type',
                   title = 'Cumulative Distribution of Projection Error per Population Estimate Type',
                   color = 'data_source_type')
fig6.show()

Observations:

  • Looking at the Population Census data, we can see that 68% of the predictions have an error of 20% or less. This jumps to 89.5% with an error of 30% or less. Thus, we think a typical error of 30% offers a reliable and conservative maximum error to expect from a UN population prediction. The 20% error is more optimistic but may lead to city planners being caught out by a larger error.
  • Both Population Estimate and Population Projection plots show that over 85% of predictions have an error of 20% or less - they show a much more optimistic result than what the Census data shows.

Conclusion:

  • We think that a city planner should anticipate a maximum typical error in a UN population prediction of 30% if you are being conservative or 20% if you wish to be more optimistic.
  • The prediction is likely to have a lower error of around ~16% most of the time and can be relatively small, within single digits.

Visualising The Typical Error¶

In the plots below, we have tried to convey the error in the UN population predictions for every individual prediction.

UN Population Prediction vs Measured Population¶

We will first plot the UN Population Prediction against the Measured Population for every single prediction in the dataset.

If the UN perfectly predicted the population, then the predictions would sit along the y = x straight line from the origin (plotted in blue). The amount a prediction diverges from this straight line is a measure of the error in the prediction. If the prediction over-estimates the population, it will sit above the straight line. If the prediction under-estimates the population, it will sit below it.

We have shaded in the +-30% UN_projerror% boundary above and below the straight line that captures the majority of the observations.

Please note - the +-30% boundary is not symmetric across the y = x line. This may seem wrong and unintuitive at first, but we are confident it is correct.

This behaviour is due to the way the percentage error is defined. Recall that the percentage error was defined as the difference between the projected population and the measured population divided by the projected population:

$UN\_proj\_error\_\% = \frac{population\_UN - population\_measured}{population\_UN}$

Dividing by the projected population is what causes this behaviour. If we move vertically up from the y = x line, the Projected Population keeps increasing. This means that you need a larger raw error (vertical distance from y = x) to get to 30% (because you are dividing by an ever larger number to get the percentage). As you move vertically down from y = x, the projected population is decreasing and thus you are dividing by an ever smaller number. Thus, you reach the raw error required for a 30% threshold faster.

If you look carefully, you will see that the horizontal distance between y = x and the boundary is always the same on either side. This is because a horizontal slice represents the population projection staying constant, and thus it takes the same size of raw error to reach 30% on either side.

The equations for the +-30% boundary lines is defined in the Appendix.

In [26]:
# Define data that will be available when you hover over a data point
hover_data = ['Country_or_area', 'year', 'population_UN', 'population', 
              'UN_proj_error', 'UN_proj_error_%','data_source_type', 
              'years_since_UN_projection', 'months_offset_from_midyear', 
              'most_recent_UN_source', 'years_since_latest_UN_source']
# Updated html code of the hover info to make it more readable and 
# change the order. 
hover_template = '<b>%{hovertext}</b> - %{customdata[0]} (%{customdata[1]})<br><br>population_UN=%{y}<br>population=%{x}<br>UN_proj_error=%{customdata[2]}<br>UN_proj_error_%=%{customdata[3]}<br>data_source_type=%{customdata[4]}<br>years_since_UN_projection=%{customdata[5]}<br>months_offset_from_midyear=%{customdata[6]}<br>most_recent_UN_source=%{customdata[7]}<br>years_since_latest_UN_source=%{customdata[8]}<extra></extra>'
hover_template = hover_template.replace('=', ' = ')

labels = {'population_UN':'UN Predicted Population', 
          'population':'Measured Population', 
          'data_source_type':'Measured Population Estimate Type'}

# Create scatter plot
fig7 = px.scatter(cities, x = "population", y = 'population_UN', 
                  color = 'data_source_type', height = 600, 
                  hover_name = 'City', hover_data = hover_data,
                 title = 'UN Predicted Population vs Measured Population', 
                 labels = labels)
# Update hover info background color and font
fig7.update_layout(
    hoverlabel=dict(
        bgcolor="white",
        font_size=10,
        font_family="Rockwell"
    )
)
# Update html of hover data in plot
for data in fig7.data:
    data.hovertemplate = hover_template
# calculate maximum population that will be plotted
pop_max = cities.population.max()
extra = 2500000
# plot y = x line from 0 to max_pop_plotted which is a bit beyond the 
# largest population that will be plotted. 
max_pop_plotted = pop_max + extra
# Define the error boundary we will plot (e.g. set as 0.3 to plot 
# a +-30% error boundary)
error = 0.3
# We will plot straight lines above and below the y = x straight line.
# These lines will show the position of the +- 30% error region (anything 
# outside these lines has an error larger than 30%). They will both go 
# through the origin. We need to define the position of the rightmost point
# on each line (the end of the line). Plotly will connect this point to the 
# origin to draw the line. 

# the rightmost point must be expressed in terms of x and y, which in this 
# case means Population_UN and population. pop_max defines the x coordinate 
# of the max point. The below equations define the y coordinate (population_UN)
# of the rightmost point for the 2 lines (pos for above and neg for below y = x). 
max_plotted_error_pos = max_pop_plotted / (1 - error)
max_plotted_error_neg = max_pop_plotted / (1 + error)
# define line for the +-30% lines (it will be invisible)
invisible_line = go.scatter.Line(color='rgb(0,0,0)', width = 0)
# define line quality for the y = x line (it will be blue)
visible_line = go.scatter.Line(color='blue')
# create marker object to hide markers at the end of the lines
marker = go.scatter.Marker(opacity =0)
# Define the transparent color of the +- 30% error region
fillcolor = 'rgba(0,50,0,0.1)'

# add invisible +30% error region line, not shown in legend
fig7.add_trace(
    go.Scatter(x=[0, max_pop_plotted], y=[0,max_plotted_error_pos], 
               name="error_over", line_shape='linear', opacity = 0, 
               line = invisible_line, marker = marker, showlegend=False)
)
# add invisible y = x line, and fill between this line and the +30% line above
fig7.add_trace(
    go.Scatter(x=[0, max_pop_plotted], y=[0,max_pop_plotted], 
               name="+30% Prediction Error (over-estimate)", 
               line_shape='linear', line= invisible_line, fill='tonexty', 
               opacity = 0.01, marker = marker, fillcolor = fillcolor)
)
# Add invisible -30% error region line, not shown in legend
fig7.add_trace(
    go.Scatter(x=[0, max_pop_plotted], y=[0,max_plotted_error_neg], 
               name="error_under", line_shape='linear', opacity = 0.01, 
               line=invisible_line,marker = marker, showlegend=False)
)
# add invisible y = x line, and fill between this line and the -30% line below
fig7.add_trace(
    go.Scatter(x=[0, max_pop_plotted], y=[0,max_pop_plotted], 
               name="-30% Prediction Error (under-estimate)", 
               fillcolor= fillcolor, line_shape='linear', line= invisible_line,
               fill='tonexty' ,opacity = 0.01,marker = marker)
)
# add visible blue y = x line
fig7.add_trace(
    go.Scatter(x=[0, max_pop_plotted], y=[0,max_pop_plotted], 
               name="Perfect Population Prediction", fillcolor= fillcolor,
               line_shape='linear', line= visible_line ,marker = marker)
)

fig7.show()

Observations:

  • It appears that predictions of smaller populations are more likely have an error outside the +-30% region. Perhaps we should think of defining multiple errors, a larger error for smaller populations and a smaller one for larger populations. It seems plausible that larger population estimates conform more to laws of large averages and that kind of thing and tend to behave in a more predictable way than smaller populations.
  • This plot may be a bit hard to interpret / understand at a glance.
  • It is essential to zoom in to the smaller population area of the plot using the interactive plotly functionality, or else you cannot see much.

Projection Error vs Population Prediction¶

Below we directly plot the projection error against the predicted population. The x axis uses a log scale to spread out the points more so the plot is more readable (otherwise the points are concentrated towards the left side of the chart and are hard to read). This is a simpler plot that easily and intuitively conveys the +- 30% error boundaries. The 20% and 30% error boundaries have been indicated via highlights. A marginal violin plot is shown on the right hand side. This provides a simplified view of the distribution of the errors and shows how the 20% and 30% boundaries capture different amounts of the projections we have tested:

In [27]:
labels = {'population_UN':'UN Predicted Population', 
          'UN_proj_error_%':'Error In Prediction', 
          'data_source_type':'Measured Population Estimate Type'}

# Updated html code of the hover info to make it more readable and 
# change the order. 
hover_template = '<b>%{hovertext}</b> - %{customdata[0]} (%{customdata[1]})<br><br>population_UN=%{x}<br>population=%{customdata[2]}<br>UN_proj_error=%{customdata[3]}<br>UN_proj_error_%=%{y}<br>data_source_type=%{customdata[4]}<br>years_since_UN_projection=%{customdata[5]}<br>months_offset_from_midyear=%{customdata[6]}<br>most_recent_UN_source=%{customdata[7]}<br>years_since_latest_UN_source=%{customdata[8]}<extra></extra>'
hover_template = hover_template.replace('=', ' = ')

# Create scatter plot
fig8 = px.scatter(cities, y="UN_proj_error_%", x = 'population_UN', 
                  color = 'data_source_type', marginal_y = 'violin', 
                  log_x=True, height = 600, hover_name = 'City', 
                  hover_data = hover_data, labels = labels, 
                 title = 'Projection Error vs Population Prediction')
# Set y axis to show percentage e.g. 0.3 becomes 30%
fig8.layout.yaxis.tickformat = ',.0%'

# Update hover info background color and font
fig8.update_layout(
    hoverlabel=dict(
        bgcolor="white",
        font_size=10,
        font_family="Rockwell"
    )
)
# Update html of hover data in plot
for data in fig8.data:
    data.hovertemplate = hover_template
# Add the +-30% error region as rectangle shading
fig8.add_hrect(y0=-0.3, y1=0.3, fillcolor="green", opacity=0.25,
               line_width=0, 
               annotation_text="<b>20% - 30% </b>Error Region", 
               annotation_position="inside bottom right",
               annotation=dict(font_size=12, 
                               font_family="Rockwell", font_color = 'green'))
# Add the +-20% error region as rectangle shading
fig8.add_hrect(y0=-0.2, y1=0.2, fillcolor="orange", opacity=0.25, 
               line_width=0, 
               annotation_text="<b>20%</b> Error Region", 
               annotation_position="inside top right", 
               annotation=dict(font_size=12, font_family="Rockwell", 
                               font_color = 'yellow'))
# Remove annotation text from marginal plot
fig8.layout.annotations = (fig8.layout.annotations[0], 
                           fig8.layout.annotations[2])
fig8.show()

Observations:

  • As before, you can almost see a narrowing down in the typical percentage error as the population predicted increases. However, we don't think we have enough observations to say this with confidence or statistical rigor.
  • This is a much simpler plot and it seems like it communicates the key information in a more direct and easy to understand way.
  • There is no need to zoom in on this plot to see the full detail -> all the information you need is immediately visible.

Other Insights¶

In [28]:
fig1 = px.box(cities.sort_values('years_since_UN_projection'), 
              y="UN_proj_error_%", 
              facet_row = 'City_Definition', height = 1000)
fig1.show()

Save Key Plots¶

In [29]:
fig7.write_html('UN Population Prediction vs Measured Population.html')
fig8.write_html('Projection Error vs Population Prediction.html')

Save Plotly Data so the interactive plots work on NB viewer.

In [30]:
# make notebook interactive on nb-viewer
import plotly

plotly.offline.init_notebook_mode()

Appendix¶

Derivation of 30% Error Boundaries¶

Here we document the derivation of the 2 straight line equations that define the +-30% error boundary on the UN projected Population vs Measured Population plot:

Recall that the percentage error was defined as the difference between the projected population and the measured population divided by the projected population:

$UN\_proj\_error\_\% = \frac{population\_UN - population\_measured}{population\_UN}$

We are looking for the +-30% error boundary, so we set the UN_proj_error_% to +-0.3:

$\pm0.3 = \frac{population\_UN - population\_measured}{population\_UN}$

And now we re-arrange the equation into a straight line format $y = mx$ such that we are expressing the y axis (population_UN) in terms of the x axis (population_measured) multiplied by a gradient term m.

$population\_UN_{\pm 30\%} = \frac{population\_measured}{1\pm0.3}$

Any point on either of these lines has a combination of Population_UN and population_measured that results in a 30% error (either above or below the 0% error population_UN = population_measured line).